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Abstract 

Motivated by a challenging problem in financial trading we are presented with a mixture of regres- 



sions with variable selection problem. In this regard, one is faced with data which possess outliers, 
skewness and, simultaneously, due to the nature of financial trading, one would like to be able to con- 
p ^ struct clusters with specific predictors that are fairly sparse. We develop a Bayesian mixture of lasso 

regressions with t— errors to reflect these specific demands. The resulting model is necessarily complex 
and to fit the model to real data, we develop a state-of-the-art Particle Markov chain Monte Carlo 

^— > 

CZ) (PMCMC) algorithm based upon sequential Monte Carlo (SMC) methods. The model and algorithm 

are investigated on both simulated and real data. 
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In the following article, we will consider a Bayesian mixture of lasso regressions with t— errors that is 
motivated by a particular problem in finance. The specifics of the data are explained in Section 5, but the 
model and resulting MCMC algorithm are generic and hence we consider a general presentation during 
the article. The data we are presented with is a collection of n e N + paired observations V n = (a3j,j/j)™ =1 
where j/i £ I is the response variable and Xi e W is the corresponding vector of explanatory variables. 
The specific objective is to cluster linear regression curves which satisfy the following constraints: 

• The regression curves are resistant to outliers 

• Each regression curve is specific to each cluster, in that the predictors for one curve may not be 
present in another 

• One would like to have relatively few predictors in each curve 

It is remarked that whilst we have been motivated by a problem in finance, this particular scenario is 
present in other real problems, such as gene expression data; see for example Cozzini et al. (2011). 
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In the context of the problem, we will then consider a mixture of regressions, from the Bayesian 
perspective. Mixture of regressions has been well studied; see for example Goldfield & Quandt (1973) and 
Hurn ct al. (2003). In addition, the issue of variable selection has also been substantially investigated, 
both in the supervised and unsupervised mixture modelling setting, by Raftery & Dean (2006) and Yau 
& Holmes (2011), for example. However, to our knowledge, there are very few articles which develop a 
(Bayesian) model with component specific variable selection, which we incorporate into our model. An 
exception in the case of frequentist statistics is Khalili & Chen (2007). We remark that we do not address 
the issue of selecting the number of components in the mixture, but this is discussed in Section 6. To deal 
with the issues of robustness to outliers and sparseness of solutions, we consider well-known procedures, by 
incorporating heavy-tailed regression error as well as a Bayesian lasso type structure (e.g. Park & Casella 
(2008)). The latter idea has also been followed by Yau & Holmes (2011). 

These model components lead to a Bayesian statistical model which is very high-dimensional. In order 
to draw statistical inference, after marginalization, we are left with a posterior distribution on the class 
labels of the mixture, component specific variable selection indicators and some additional parameters. 
Due to the complexity of the resulting posterior, very sophisticated computational tools are required. We 
focus on using PMCMC (Andrieu et al. 2010), which is particularly useful for statistical models with latent 
variables. The PMCMC algorithm uses an SMC algorithm (e.g. Doucet et al. (2001)) to update latent 
variables: we focus on the class labels which have a larger state-space than the variable selection in the 
examples considered. We develop an SMC algorithm and subsequently a conditional SMC algorithm for our 
particle Gibbs algorithm (a special case of PMCMC). The PMCMC algorithm reflects the current state- 
of-the-art in Bayesian computation and gives us the best chance of reliable inference from the posterior; 
although we remark that it is far from infallible and can break down for sufficiently complex problems. 

The outline of the paper is as follows. In Section 2 we describe the hierarchical representation of the 
model and justify the choice of priors that lead to the posteriors of interest. In Section 3 we present our 
PMCMC algorithm. In Section 4 we investigate the model and algorithm on simulated data. In Section 5 
we describe the applied problem and the data which we analyze. In Section 6 the article is concluded and 
some avenues of future work are discussed. 



2 The Model 



2.1 Set-Up 

Generalising the peculiarities of the financial data we want to investigate, let us first highlight the relevant 
aspects of the problem that motivate the mixture of regression model we propose. 
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Recall we have a collection of n € N + paired observations V n = (a;*, J/i)?=i where € R is the response 
variable and x, e K p is the corresponding vector of explanatory variables. To simplify notation we use 
%i-.p,i to indicate the collection of covariates at the i th sample and set the first element xij to be 1 to 
allow a more convenient formulation of the model. The defining characteristic of the data is that the n 
samples are generated from a heterogeneous population and only few of the p covariates convey any useful 
information to explain the variability of the j/j. 

To answer these demanding conditions, we propose a Bayesian mixture model which postulates that 
there arc K < n possible linear regression curves (one can consider more general basis function, but this is 
not done here) to describe the data and that each curve potentially depends upon a different collection of 
the variables 1, . . . ,p. To facilitate the derivation of a sparse solution, we introduce a p-dimensional binary 
vector 7* :p , where we use 7* to denote (7*, ... ,7^), which encodes whether each of p observed covariates 
should be included or not in the k th regression curve for k = 1, . . . , K. Similarly, we use 7* as a subscript 
indicator which deletes the elements corresponding to 7^ = for d € {!,... ,p} and returns a vector of 
length |7f :p |i (Li-norm). 

The mixture model is then defined as the conditional distribution of yi given Xi 

K 

(1) Vi\Xi, (3, W, S i: 7i: P ~ ^2 w k-^{^k i^k ,s^) 

fc=l 

where Mi (/x, S) is the I— dimensional normal distribution of mean \x and covariance E. Note that, to simplify 
notation, when I = 1 we drop the subscript. (1) is a mixture of normal distributions with parameters 

• Wk with < Wk < 1 for k = {1, . . . , K} such that Y^k=i w k = ^' are *he mixing proportion of the K 
components. 

• Pi-.p w ith € R for d = {1, . . . ,p}, is the collection of regression coefficients. 

• s k , with e R + for i = {1, . . . ,n} is a variable introduced to allow a Student t— regression error. 

Having defined the model, the values of the parameters * = (w, (3, s, 7) are unknown and will have 
to be inferred from the data D n using a Bayesian approach. 

Note that throughout our discussion we assume that the number of clusters K is known. In a different 
situation, we could have included K in the set of unknown parameters and modified the estimation process 
accordingly. While this would be a standard procedure, it adds another level of complexity to the model 
that we rather avoid here since it is not the focus of our investigation; see Section 6 for some discussion. 

2.2 Hierarchical Specification 

Whilst a mixture of Gaussian distributions as described in (I) is a fairly general model, it is also flexible 
enough to allow us to choose convenient priors that achieve the objective of making the model robust 
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to outliers and selecting only the relevant covariates. This task is facilitated by using a hierarchical 
representation of the mixture model and having different levels of priors and hyperpriors. 

Following the standard missing data approach, see Diebolt & Robert (1994), we introduce, for every 
i th — data point, the latent allocation variable Zi £ {1, . . . , K} which indicates the membership of yi to the 
k th ~ cluster. Thus, we can simplify the mixture structure and note that the conditional distribution of yi 
given zi — k, with probability p(zi = k) = Wk, is the Gaussian distribution 

(2) 1 7i fe P , xK , p k , si z t = k~ M(x' . /& , s k ). 

1 <l:p flip flip ' ' l:p 

Assuming the mixture weights follow a Dirichlet distribution, the prior on wi-.k-i is Wi-.k-i ~ T>ir{5) 
where Vir{8) is the symmetric Dirichlet distribution. 

2.2.1 Distribution of s k 

Following the hierarchical representation, given Zi — k, the distribution of the variance parameter s k in 

(2) , is set to be 

s k - Qa(d/2, d/2) 

where Qa{a,b) is Gamma distribution of mean a/b. The hyperparameter d corresponds to the degrees of 
freedom of the student-i distribution. 

2.2.2 The Bayesian Lasso 

A very important feature of the model we propose is that it combines, in a mixture framework, shrinkage 
and variable selection. It achieves this result by adopting specific priors for the regression coefficients /3 
and the binary indicator variables 7. 

Tibshirani (1996) showed that using a ML approach in a single mixture component framework, one 
can regularise the estimated linear regression coefficients (3 k . p introducing the penalty term: h\(/3 k . p ) = 
Y^id=i \Pd\ 9 f° r some q > and € M + . The effect of penalising the likelihood function is to shrink the 
vector of MLE of (3 k . p toward zero with the possibility of setting some coefficients exactly equal to zero. 

It is well known that similar results to the Lasso penalty can be achieved by assuming that j3 k . p have 
independent Laplace, i.e. double-exponential priors, 

(3) P(rt. P \4) = ft o*r exp [ zh 0] 

where u\ e R + determines the scaling of the regression coefficients in the k th — curve and e R + is the 
smoothness parameter that controls the tail decay. Since the mass of (3) is quite highly concentrated 
around zero with a distinct peak at zero, the regression coefficient estimates corresponding to the posterior 
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mean and posterior mode are shrunk towards zero in equivalent fashion to the penalisation least squares 
estimation procedure. 

The double-exponential distribution can be represented as a scale mixture of normals with exponential 
mixing distribution. Therefore, introducing a latent vector of scale variables we obtain a more tractable 
hierarchical formulation of the prior on /3±. p . Ignoring for the moment the 7 1:p indicator and assuming 
a single component mixture, consider the following hierarchical prior on the d th regression coefficient: 
/3d|rJ,A <~ A/"(0, rj) where the hyperparameter tJ itself has hyperprior rj <~ £x(X 2 /2) (£x(a) is the 
exponential distribution of mean 1/a.). We note that marginally (3^ still follows a Laplace distribution 
with parameter A, p{fi d ) = J °° p(/3d|rj) p{rj) drj cx exp(-A 

The modular structure of hierarchical modelling allows us to extend, in a straightforward way, the 
Bayesian Lasso method to our proposed mixture of linear regression. Together with the prior on (3 k k we 

Tl:p 

also specify priors on the hyperparameters Ufe, with Ofc G M + , to control the scaling, and rf. p , with r| € M + , 
to induce shrinkage on the coefficients of the k th regression curve. 

^Jalr^^.p ~ ^ 7Ul (o,^diag(r 7 2 £)) 
a\ ~ XQa(a, b) 
r?| 7l fe p ^ ^(A 2 /2). 

' l:p 

XQa(a, b) is the Inverse-Gamma distribution of mean b/ (a — 1) (a > 1). Whilst in our discussion we assume 
A is given, Park & Casella (2008) have shown, in a non-mixture Bayesian framework with known ji-.p, that 
the Lasso parameter can be chosen by marginal maximum likelihood or using an appropriate hyperprior. 

2.3 Variable Selection 

Many authors, such as George & McCulloch (1997), Kim et al. (2006) and Schafer & Chopin (2012) have 
proposed an effective solution by once again specifying a convenient prior for the selection indicator 7^. 
We specify selection priors that fit into the mixture framework of regularised regressions. A suitable prior 
for 7^ is the Bernoulli distribution Be(<p) mutually independent across independent components. 

We should also point out the level of the flexibility of the mixture model. By making 7* cluster 
specific, each regression curve can be a function of its own different set of covariates. On the other hand, 
the combinations of competing models to be evaluated grows exponentially with the number of explanatory 
variables and linearly with the clusters, K2 P . In theory, for the given prior, we could compute the posterior 
probability of each model before selecting the best one. In practice, it is evident that a full exploratory 
search is unfeasible and we need to incorporate a selection procedure into the sampling algorithm. 
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2.4 Posterior Distribution 

The hierarchical representation of our Bayesian model can be observed in Figure 1. We have also discussed 
in the previous section how the desired properties of the model are achieved by specifying a convenient 
structure. We now give some details on the posterior of interest 




Figure 1: Directed Acyclic Graph (DAG) showing the hierarchical structure of the priors on the parameters 
of the proposed mixture model. We have drawn a square box around hyperparamcters considered to be a 
known constant, a circle to indicate an latent variables that need to be estimated, and a rectangular box 
to indicate observed data. The arrows indicate the conditional dependence structure of the model. 



Using a synthetic notation to indicate the unknown parameters of the model xjj — (wi-k, <Ji-.k, @i-.k, s%:ni 1i-.p> T i P ), 
and the fixed, assumed known, hyperparameters of the model h = (a, b, A, </>, d, S), we can say that, after 
observing the covariates x = (xi, . . . , x n ) and the responses y = • • ■ , Vn), the posterior distribution of 
ip is 



(4) 



ir(ij>\x, y) oc L(y; x, i/))p(ip\h) 



where L(y;x,tp) is the likelihood function and p(ip\h) the prior distributions we have previously defined. 

Since our main focus is to draw inference on the cluster membership of the observations and identify 
the relevant explanatory variables, we remove as many other variables as possible. We integrate out the 
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parameters /3i-.k, &i-.k and Wi-.k-i in (4) 



n{Zl:n, s l:n, "fl-.p, Ti.p\T> n ) = J 7r(zi :n , Sl :n , 7l :p , rf. p , /3l-.K , (?1:K , Wl:K-l[D n ) d((3\;K, <J\:K , W\ : K-l) 



and obtain the marginal posterior density of interest up to a normalizing constant 

:ni s l:n> Tl:p> ^l-.pl^n) ^ 

(5) 



n k (sin, 7i fc P , r^i^) { n d/2, d/2) } x 

fe=l L I- i=l J 



II ^ fc ;l,A 2 /2) 



nf=ir(* + n fc ) 



r(Ef=iK + <5]) 



where T(-) is the gamma function, <p(x;a,b) is the Gamma density of mean a/b, n k — J2i=i \k}{ z i) the 
number of observations assigned to the k th cluster, V k is the collection of observations assigned to the k th 
cluster. Given zt — k, we can derive 



&(«L>7iV#|X> fc ) = 



|Vfc| 1 /2 7r r lfc /2 r ( a ) 



with 



14 - diag(l,rr) 

7l : p 

"*£ - W 7f!p £-£/) 
Ofe = a + n k /2 

bl = b+^)'V- s l/-{ml)'{VZ)-'ml}/2 

where E s fc = diag(sj , . . . , s*). We should also be aware of label switching problem (e.g. Jasra et al. (2005)) 
which is a common issue when estimating the parameters of a Bayesian mixture model. This is addressed 
in Section 4. 



3 Simulation Methodology 

We adopt an MCMC strategy (see Robert & Casella (2004) for a review) to sample from the target dis- 
tribution. We should first note that, within the mixture modelling literature, there has been work done 
on perfect sampling and direct sampling, making use of the full conditional distributions. For example, 
Mukhopadhyay & Bhattacharya (2011) proposed a perfect sampling methodology for fitting mixture mod- 
els. Fearnhead & Meligkotsidou (2007) instead proposed a direct sampling method that returns independent 
samples from the true posterior. Unfortunately, the described algorithms have limited applicability in our 
context. 
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In light of recent work presented by Andrieu et al. (2010), we adopt a Particle Markov chain Monte 
Carlo (PMCMC) simulation procedure which combines MCMC and SMC methods and takes advantage 
of the strengths of both. The key feature of PMCMC algorithms is that they are in fact exact approx- 
imations of idealised MCMC algorithms, while they use sequential Monte Carlo methods to build high 
dimensional proposal distributions. On the other hand, compared to stand alone SMC, PMCMC sampling 
is more robust to the path degeneracy problem, described later on. More precisely, here we implement a 
particle Metropolis-within-Gibbs algorithm. Below we describe the constituents of the algorithm, which is 
summarized in Section 3.4. 

3.1 Sequential Monte Carlo Algorithm 

SMC methods are a general class of algorithms that use a set of weighted particles to recursively approx- 
imate a sequence of distributions of increasing dimension. It has been originally introduced to deal with 
situations with dynamic observations. Nonetheless, it has demonstrated to be highly effective also in static 
problems like mixture models and it is an integral part of PMCMC. Before illustrating how SMC algorithms 
are used in our sampling procedure, we refer the reader to Doucet et al. (2001) for a detailed review of 
SMC methods. In particular, we assume the reader is familiar with Sequential Importance Sampling (SIS). 

3.1.1 Sampling Cluster Labels 

We use an SMC method to sample sequentially from 7Tj(z 1:i |s 1:i , "f\ :p , Tf. p , T>i) as i increases. Following 
Algorithm 3.1, we first initialize Si :n , 7i :p , Tf. p by sampling their respective priors, and then alternate 
sequential importance sampling and resampling steps. More explicitly, the sequential importance sampling 
targets the full conditional density of the latent labels variables Z\ :i which, after the first data 
points, is 



where V k denotes the data allocated to the k cluster out of the first i observations and rv k = 



3.1.2 Adaptive Resampling 

SIS is subject to the problem of weight degeneracy. As new incoming observations are fed into the algo- 
rithm, the variance of the importance weights typically increases at an exponential rate until all the mass 
concentrates on one single particle, leaving the remaining particles with weights tending to zero. 




Ym=i ^{k}( z i) their total number. 
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To avoid spending a large computational effort to update trajectories whose contribution to the final 
estimate is negligible, we execute a resampling step with the intention of replacing the unpromising lowest 
weighted particles with new particles that hopefully lie in regions of high target density. The exact proce- 
dure consists in sampling N particles from the approximated target distribution to obtain N new particles 
which will then be equally weighted. On the other hand, if one resamples very often, we will rapidly deplete 
the number of distinct particles and the approximation of the target will suffer because the paths of z\-i 
become very similar (path degeneracy). 

To find a balance between weights degeneracy and path degeneracy, Del Moral et al. (2012) among 
others, suggest to resample only when the variance of the unnormalized weights is above a fixed thresh- 
old. In the solution we adopt, the threshold is a function of the Effective Sample Size (ESS) ESS = 
(j2f=i(Wi) 2 ^J which takes values between 1 and N and, as described in Algorithm 3.1, we resample 
only when it is below ESS < N/2. 

It should be noted here that executing the resampling step only when the condition ESS < N/2 
is satisfied, does not alter the property of the algorithm that still returns an unbiased estimate of the 
normalising constant, as noted in a personal communication by C. Andrieu and N. Whiteley - see the work 
of Arnaud & Le Gland (2009). 

3.2 Conditional Sequential Monte Carlo Algorithm 

The conditional SMC algorithm we iterate in the second stage of our sampling procedure is essentially the 
SMC algorithm described in Section 3.1 except it preserves the path of one particle. 

To describe the algorithm, we need to introduce a sequence of indexes b\. n G {1, . . . , N} n to represent 
the genealogy of the t th particle for i € {1, ... , N}. Once we have set b l n — t, the genealogy of t th particle 
can then be defined recursively b\ — a i i _ 1 for i = 1, . . . , n — 1 where the ai :n _i = (a\. n _ 1 , . . . , aff„_ 1 ) are 
the recorded samples from the previous iteration of the SMC algorithm. 

As we can see from the Algorithm 3.2, the sampling sequence is similar to what is implemented in a 
standard SMC algorithm except that one randomly chosen particle t with its ancestral lineage b\. n is fixed 
and ensured to survive, whereas the remaining N — 1 particles are regenerated as usual. 

3.3 Markov Chain Monte Carlo Steps 

In the SMC algorithm we sample from the posterior distribution of the latent label indicator variable z\- n . 
With the MCMC steps our objective is to update the other parameters of the mixture model that control 
the regression error distribution, the regularization of the regression coefficients and the variable selection 
process. The MCMC steps which target the posterior (5) are as follows. 
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Algorithm 3.1 Sequential Monte Carlo Algorithm 

Step 1. Sample A labels, z\,...,z^, from 7r 1 (z 1 |---) and set the corresponding weights W{ — 1 for 
j = l,...,N. 

Step 2. For i = 2, . . . , n repeat the following 



1. If ESS < A/2, for each j = {1, . . . , A} resample a J i _ 1 £ {1, . . . , A} using the discrete distribution 



Wi-i 



w 3 , 

v r I— 1 



E,=i w ?-i 



Otherwise keep all the current particles by a\_ x = j for j £ {1, . . . , AT}. 
2. Sample, for each j e {1, . . . , A}, a label from iTi(zi\ • • • ) where 



7r*(^i! ■ • • ) = 



(*-i)>«2_i 



££=i 6, 7i%, rlfi \V Zi )T{5 + 1 + r^r 1 ^- 1 ) 



and n^^" 1 = £1=1 1 {fc} (^ atl) ). Set *£. = (4"-W)- 



3. Set, for each j G {!,..., N} 



W? = 



T K 


n^i ^,^|i?«)r(* + n« J ) 


/r(£f =1 (np + *)) 




/r(Ef= 1 (4 i - 1)J + ^)) 



and i = i + 1. 



Algorithm 3.2 Conditional Sequential Monte Carlo Algorithm 



Step 1. Sample 1 — A labels z\ from 7Ti(zi| • • • ), for j = 1, . . . , N while j ^ b\ (i.e. excluding j = b\), and 
set all the weights W[ = 1 for j = 1, . . . , N. 
Step 2. For i = 2, . . . , n repeat the following 

1. If ESS < A/2, for each j £ {1, . . . , A} except j — &*, resample a 3 i _ 1 e {1, . . . , A} using the discrete 
distribution 



Otherwise keep all the current particles by a\_ 1 = j. 

2. Sample z\ from ni(zi\ • • • ) for each j e {1, . . . , A} except j = b\, and update the corresponding path 

z l:i — (^l'i-D z i )• 

3. Set W- as for the SMC algorithm, for each j € {1, . . . , A}, (this includes the fixed particle j = b\) 
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3.3.1 Step 1: Update T?. p 

To update the Tf. p , given all the other variables are fixed, we can use the following procedure. For each 
k E {1, . . . ,K}, assuming |7i :p |i > 0, sample for each d where 7^ = 1, (t 2 '*)* = T 2 ' fe expjV T Nd} with 
v T > a user-set parameter and ~ A/"(0, 1), independent for each d. Accept all the (rj' )* with 
probability 

1A ^(^n.Ti^,^)^) n y ((r d 2 - fc )*;l,A 2 /2)(r^)* 
€fc(*L,^= P .Ti% fc |^) d; ^ ^;M 2 /2)r 2 < fc 

otherwise keep the current T 2 p . 

3.3.2 Step 2: Update s i: „ 

To update si :n , given all the other variables are fixed, we can use the following procedure. For each 
i e {1, ...,«}, fc € {l,...,if} propose (sf)* = s- 5 exp{f s iVj} where f s > is a user-set parameter 
(potentially different from the v T above) and Ni <~ A/"(0, 1), independent for each i. Note that (s* :n )* 
features only one changed value from Si. n . The proposed move then is accepted with probability 

1A U{sl. n )\li. P ,Tl*\V k ) n y(( a fc)*;d/2,d/2)(^)* 

otherwise keep the current s k . 

3.3.3 Step 3: Update -y 1:p 

To update 71 :p , given all the other variables are fixed, we can use the following procedure. For each 
d £ {1, . . . ,p}, k £ {1, . . . , if} (i.e. propose to change only one element each time), if 7^ = we propose 
(7^)* = 1 and draw (r|)* from its prior (Qa(l, A 2 /2)). The proposed move is accepted with probability 

6(Sl:„,7l:p,Tl:pl^fe) 

otherwise we keep 7^ = 0. If 7^ = 1, we propose to set it to be zero, removing the corresponding r 2,fc and 
using the same expression as above to accept/reject (with the appropriate changes i.e. the proposed state 
here has fewer variables than the current model). In this proposal, we are adding or removing columns 
from our design matrix. 

Note that this algorithm is best suited for scenarios similar to the ones we investigate in this paper, 
where the number of components K > 2 and the number of data points n > 30 make the space to be 
sampled much bigger than the one for the explanatory variables. 



12 



3.4 Sampling Procedure 

The sampling procedure consists of 

• Stage I: Initialise the algorithm. Sample si :n , 7i :p , rf. p from the respective priors. Run the SMC 
algorithm, as described in section 3.1, storing all the N particles labels z\- n — z\. n , . . . , and their 
genealogy ai : „_i = (a\. n _ 1 , . . . , a(f rl _ 1 ). Sample one particle index t € {1, . . . , N} according to the 

1 N 

normalized weights W nl . . . , W n . 

• Stage II: Repeat the following steps until convergence 

1. Run the conditional SMC algorithm, as described in section 3.2. 

2. Sample t e {1, . . . , N} according to new weights , . . . , Store z\. n and b\. n . 

3. Given z\. n , update the current values of Si :n ,7i :p ,Tf. p following the MCMC steps described in 
section 3.3. 

This provides a valid MCMC algorithm with the posterior of interest as an appropriate marginal; see 
Andrieu et al. (2010). 

4 Simulation Study 

4.1 Simulation Settings 

We assume one basic scenario that we then perturb to highlight the different properties of the model and 
different important aspects of the simulation procedure. In the standard scenario the parameters of the 
model have been randomly generated from the following priors w 1 . K _ 1 ~ Vir(2) ,s k <~ (?a(2,2) ,7*. p 
Be(l/2) ,r 2 l k | 7l fe p L ^ d - £x(l/2) ,a 2 k ~ Xga(2,4) ,(3 k k \a 2 k ,T 2 l k n t p ~ Af^ , ( 0, ^diag(r 2 ^ )) and each 

il:p 'l:p 'l:p \ 'l:p / 

data point is then sampled from the mixture model. Each dataset we generate contains n = 50 paired 
observations sampled from a mixture of three components K = 3. The covariates Xi of dimension p = 20 
are sampled from a centered Gaussian distribution whose dispersion depends on the cluster membership. 
The only parameters of the simulation algorithm we need to set are: the number of particles, say 
= 100; the step length of the MCMC move for t, say v T = 2; the step length of the error update, say 
v z = 3, and also the number of repeats of the sampling procedure, say a few thousand. 

4.2 Resampling 

An important aspect of the simulation behaviour that we can partially control is the weight degeneracy. 
By introducing the adaptive resampling step we limit the risk of the empirical probability mass collapsing 
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on a single particle. We arc equally aware that resampling tends to replicate the most likely paths and 
might lead to an impoverished diversity of explored paths. This effect is marginally alleviated by limiting 
the frequency of the resampling. 

Figure 2 shows that, in our case, adaptive resampling ultimately is beneficial to preserve both path and 
weight diversity. We note in the left column that if we resample after every new observation is processed, we 
end up fairly quickly with a single path that gets replicated for all N particle. With adaptive resampling, 
on the other hand, the degeneracy of weights and paths is maintained at a tolerable level. In the right 
column, we preserve a variety of paths that might have different likelihood as shown by the more disperse 
ESS plot. Note also how in some instances no resample is performed for several runs and the number of 
particles remains stable as it is their weight. Even if at the end of every iteration of the Markov chain we 
only need to store one single particle, it is important that we are able to preserve a richer variety of paths 
and consequently a more disperse weight distribution from which we can sample. 

4.3 Clustering Accuracy 

To test the clustering accuracy of the model, we generate datasets using the simulation settings described 
in Section 4.1. We then let the algorithm run and for each iteration we save one particle that represents 
one sample from the posterior distribution of the label indicator variables. Once we have collected enough 
samples we analyse the distribution of the Adjusted Rand Index Score over the sampled paths. To deal 
with the label switching problem we permute the labelling to maximize the adjusted rand index, computed 
w.r.t. the cluster assignment associated to an external model or corresponding to the null hypothesis we 
want to test (like the macro sector partition in our real life problem). 

In Figure 3 we can see that the distribution is highly skewed towards 1, which means that most of the 
time the suggested clustering assignment perfectly matches the true clustering. In other words, given that 
the classification probability distribution we try to approximate is fairly accurate (which seemed to be the 
case on the basis of our convergence assessment) the model seems to provide a good clustering, at least in 
this example. 

4.4 Variable Selection Accuracy 

The other major point we want to investigate is the accuracy of the variable selection approach. We would 
hope that the model identifies as many informative variables as possible, and at the same time is sufficiently 
parsimonious to exclude as many as possible of the noise variables. To that end the sensitivity index is the 
ratio of the number of true variables detected to the sum of the same value added to false negatives and 
specificity index is the ratio of the true negatives to the the sum of the same value added to false positives. 
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Figure 2: Left Column: Unconditional Resampling, we resample systematically every time a new observation is fed into the 
SMC algorithm. Right Column: Adaptive Resampling, we only resample whenever the ESS falls below a fixed threshold. 
Top Row: Weight Degeneracy, measured as ESS/N, where 1 means all particles have equal weight, and means the entire 
probability mass is on one particle. Bottom Row: Path Degeneracy, measured as percentage of paths that remain different 
as we loop through the observations. Each line represents three separate repeats of the sampling procedure and darker lines 
correspond to earlier iterations. 



These are the measures used to assess the accuracy of the variable selection. 

In Figures 4 we look, as before, at the distribution over all MCMC iterations of the relevant indexes, 
in this case the sensitivity and specificity indexes. We remark that given the relatively small number of 
variables, p = 20, we should not be surprised to observe some very coarse distributions, since there are only 
so many informative or noise variables. In both plots it is evident that the overall variable selection accuracy 
is considerable. The sensitivity of the selection algorithm is fairly high, since most of the informative 
covariates are included and play a role in the regression curves. Conversely, the specificity index is equally 
good if not better, as very few noise variables are retained at all. We can explain the marginally lower 
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Figure 3: Adjusted Rand Index distribution. For every MC iteration we record the Adjusted Rand Index 
score of the proposed cluster assignment versus the true clusters labels. Where a distribution centered 
around zero would be an indication of random assignment, the observed values give evidence that the 
model is successfully assigning most of the data points to the proper cluster. 



sensitivity compared to the specificity, by noticing that the model is successfully parsimonious and achieves 
a satisfactory clustering performance even with only a smaller subset of the informative variables. 



Sensitivity Index Distribution Specificity Index Distribution 
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Figure 4: Variable Selection accuracy over all MC iterations. In the left plot we show the distribution of the 
sensitivity index, i.e. the ability of the algorithm to identify the truly informative variables. In the right 
plot the specificity index measures the accuracy of the model in isolating the non-informative variables, 
which shows that the model is very precise in excluding the noise variables. 
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5 Financial Markets Data 
5.1 Applied Problem 

In the financial literature it is common practice to group markets into macro sectors based on the type and 
nature of the good exchanged. Practitioners operating in financial markets adhere to this convention and 
consider each sector as a separate area of expertise. This approach is reasonable for fundamental investors 
who have to be knowledgeable on the underlying factors driving demand/offer and have to elaborate the 
relevant information as news become public. 

A partition of the markets that mirrors the macro sectors is less suitable to systematic traders who 
take investment decisions based on algorithms which depend only on the evolution of prices. Under these 
circumstances, developing and optimizing a quantitative strategy on a sector by sector basis seems rather 
arbitrary. This is because the only input considered when engineering the strategy is the time series of 
prices whose behaviour is not necessarily a function of the sector. A clustering method which is more 
consistent with a systematic and objective approach, should identify homogeneous clusters of markets that 
share similar price dynamics characteristics. 

Our approach starts by selecting, across all sectors, those major financial markets for which we have 
records spanning up to twenty years of trading (see the following Section). Under the assumption that all 
the relevant information about a market can be extracted from the historical prices, we then compute for 
each market the summary statistics that measure the critical features of the distribution and the temporal 
dependence of time series of returns. 

In a supervised learning framework, the statistics of the price dynamics can be seen as explanatory 
variables that can help us understand why the trading performance is different across markets. When we 
apply a trading algorithm to every market, one can observe that the risk-adjusted profit we obtain is not 
consistent across markets. The supervised model we propose should be able to regress the profitability 
of the trading algorithm on some of the features we record for each market. An unsupervised approach 
is considered in Cozzini (2012), where the relative benefits of supervised versus unsupervised learning are 
investigated. 

Assuming we achieve a more accurate partition of the markets, we are in an ideal position to develop 
a systematic trading strategy that better suits the markets within each group. A strategy that has been 
optimized on a market by market basis would likely be ovcrfittcd and would not have enough back testing 
(i.e. for testing the algorithm) data. If, instead, we devise a trading algorithm that consistently performs 
on a group of markets, we are bound to obtain a more robust and convincing result. At the same time, the 
significant features that are responsible for driving the clustering process give us an insight on the critical 
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aspects of price dynamics that should be exploited by the trading strategy. 

In order to reach credible conclusions about how to partition markets and what are the informative 
features of the price dynamics, we need a clustering method which is able to address the following issues 
characteristic of financial data: 

• Outliers and potential skewness 

• Non Informative Variables 

• Fewer Observations than Explanatory Variables. 

If we succeed in proposing a model whose performance is not hindered by these issues, we will have increased 
confidence in the trading strategy that we develop based on the outcome of clustering and variable selection 
process. By being able to implement a more targeted strategy on each group of markets, we should achieve 
better investment returns. 

5.2 The Specific Data Analyzed 

The dataset analysed has been kindly provided by AHL Research, a quantitative investment manager, and 
integrates external sources with proprietary records of live prices sampled during actual trading activity. 
The selection of markets considered covers several sectors, assets classes and regions. The details of each 
market considered are listed in Table 1. The frequency of the samples is daily, typically the end of day 
official settlement price, whenever the exchange provides one. 

For each of the n — 36 financial markets we have data for, we compute p = 13 statistics which we 
arrange in a 36 x 13 data matrix. The complete list of 13 variables analyzed are reported in Table 2; 
they comprise a variety of statistics, some descriptive of the distribution of returns such as kurtosis, some 
constructed to maintain and characterize the time-series structure of the data, such as autoregression order 
or more advanced as the Rescaled Range statistic (e.g. Hurst (1951)). It is not known which, if any, 
statistics can explain the response detailed below. 

The ultimate goal of the study is to find a more appropriate systematic trading strategy whose param- 
eters can be robustly calibrated on clusters of similar markets. To verify that the markets' return features 
we have described are related to the trading strategies we are interested in, we compute a risk adjusted 
measure of investment performance as response variable. We adopt a simple moving average crossover to 
generate buy or sell signals and target a constant risk profile by scaling positions according to a rolling 
volatility measure. Given a time series of prices {pt}f = i, and EMAi = p\, the exponential moving average 
at time t > 1 is 

EMA 4 = apt + (1 - a)EMA t _i 
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Table 1: Data. Market: Code identifies market. Type: Contract used for transaction. Cash, X, exchange 
traded futures, F or forwards C. Start: First date of daily records. CCY: Currency. 



where a represents the degree of exponential decay of the weights associated to older prices. The value 
of a determines the speed at which the exponential moving average reacts to a new recorded price and 
ultimately how close it tracks the price process. To generate our position signal we compute a fast EMA^ F ' 
and a slow EMA( S ) by fixing = {0.03} and — {0.01} respectively. A buy signal is then generated 
every time the fast moving average crosses from below the slow moving average; conversely, a sell signal is 
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Rank 
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Standard Deviation of daily returns 
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Skewness of daily returns 
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Kurtosis of Daily Returns distribution 
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Estimated degrees of freedom for a fitted t distribution 
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Estimated tail shape parameter for a fitted Generalised Pareto distribution 
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7 


autoq 


Box-Pierce Q-statistic from autocorrelation coefficients 


8 


box2 


Ljung-Box test statistic at lag 2 
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Fractal dimension estimate using Whittle's method 
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Rescaled Range Statistic 
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Estimated Hurst exponent using waveletFit from fArma R package 
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Generalised Hurst Exponent 



Table 2: Explanatory Variables. 



given when it crosses from above. The position, pos, is then held proportional to the difference between 
the two moving averages, 

pos t = (EMAf } - EMA<: S) )/vol t 

where vol is a measure of the rolling volatility (see Cozzini (2012) for the details on how it is computed). 
The point of scaling the position proportionally to the volatility of the price process, is to automatically 
adjust the risk of our exposure to the perceived uncertainty of the market. In practice, when the volatility 
increases we would scale down our positions. The annualised Sharpe Ratio (Sharpe, 1966) measures 
the average return per unit of risk and it is computed from the sequence of the daily profits and losses 

n = pos x _ t x (p t -pt-i) 

250/TELn 

^/250Var(r) 

where 250 is the number of working days in a year. This is used as our response variable. 
5.3 Data Analysis 

We now apply the penalised mixture of t distributions to our financial markets data. We fit the Bayesian 
mixture of Lasso regressions, propose a new clustering of the financial markets and compare the results to 
the original macro sector partition. In the present implementation of the mixture of regressions we will 
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assume the number of components is K = 2. We remark that we have considered the analysis with K > 2, 
but did not find any qualitative improvements in the data analysis. 

To fit the model we follow the sampling procedure described in Section 3.4 and execute 10000 iterations 
of the PMCMC algorithm with adaptive resampling. The acceptance rate of the updates for r, s were 
0.23, 0.24 respectively, which are sensible values. We should remark that to deal with the label switching 
problem, common to many Bayesian mixtures, at each iteration we order and relabel the clusters according 
to the average Sharpe Ratio of each cluster. 

In Table 3 we quote the relative selection frequency of each variable across all PMCMC iterations. We 
compute the average of the 7 indicator vector for each cluster over the 10000 iterations we run. According 
to the evidence, it appears that markets in cluster A are characterized by some specific features that are 
well represented by the Generalised Hurst Exponent (Di Matteo et al. 2004), the simple Hurst exponent 
computed using wavelet method and the Kurtosis. Since the markets in cluster A on average show a 
better Sharpe Ratio, we can conjecture an immediate link between the performance of the trend following 
strategies we tested and the persistence of markets as captured by these non-linear statistics as we can see 
in Figure 7. On the other hand, the boxplots on the right column of Figure 7 seems to suggest that the 
markets in cluster B are characterized by a lower kurtosis and lower autocorrelation as measured by the 
Box-Pierce Q-statistic. 

From the sampled posterior distribution of the label indicator we can infer how likely two markets 
belong to the same cluster. The relative frequency, over all PMCMC iterations, of the event that the two 
markets were assigned to the same cluster was used as a (posterior) measure that the two markets belong 
to the same cluster. Based on this, we can compute a distance that ranges between zero, if the markets are 
always in the same cluster, or one, if the markets are always assigned to distinct clusters. This approach 
allows us to compute a dissimilarity matrix between all markets and use this information to propose a 
hierarchical clusters as shown in Figure 6. 

From the dendrogram in Figure 6 we can obtain a hard cluster assignment under the assumption K = 2. 
The clustering results are reported in Table 4. We can see how some markets that belong to the same macro 
sectors are kept together. For example, all interest rates markets are assigned to cluster A, while cluster B 
comprise all agriculturals and currency markets. Some other sectors are fairly evenly split between the two 
mixture components, this is the case for bond and stock markets. Still, in some cases, we can see a pattern 
where for example the more liquid stock markets are assigned to cluster B and the emerging markets, with 
the exception of German Dax, belong to cluster A. 

As a further check of the relevance of the clustering proposed in Table 4 we can observe the value of 
the response variable y for each cluster. In Figure 7 we note a certain differentiation of the Sharpe Ratio 
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0.41 


13 


wavH 


0.31 



Table 3: Ranking of the variables according to the frequency they have been selected to model Cluster A 
and Cluster B. 

between the two clusters, with cluster A markets generally showing a better performance of the trading 
strategy. 

6 Summary 

We have considered a Bayesian mixture of lasso regressions with t— errors, designed for a financial data 
analysis problem. We applied a PMCMC algorithm to sample from a marginalized posterior and inves- 
tigated the model and algorithm on simulated and real data. In our real data example, we saw that the 
clusters returned by our model could perform better in terms of profitability on an in-sample basis, than 
the physical clustering of the market. There are many issues that can be investigated in future work. 

Firstly, with regards to the theoretical properties of the model. We did not investigate, for example, 
the issue of Lindley's paradox, which can manifest itself in mixtures (e.g. Jennison (1997)). That is, 
we would like to know if there are some combinations of prior parameters, which would lead one to 
favouring statistical models with a single component. In connection to this, whether the complex posterior 
also satisfies a collection of inequalities for model probabilities as is the case for some standard Bayesian 
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Figure 5: Boxplots of top ranking selected variables by cluster. 



mixtures; see Nobile (2005). 

Secondly, the computational procedure of selecting the number of components. There are at least two 
options which we intend to consider in future work. The first is simply to use our PMCMC algorithm in each 
model. Then, as one can easily obtain a marginal likelihood estimate (indeed using the proposed particles - 
'all the samples' - see Andrieu et al. (2010)) and compute Bayes factors - see e.g. Nobile (1994). The second 
idea is to build a trans-dimensional sampler based upon PMCMC and SMC samplers (Del Moral et al. 2006). 
Here, one uses a trans-dimensional version of the PMMH sampler. Suppose one has a target density irk {%) in 
dimension k and our overall target density is: ir(k, x) oc iTk(x)p{k) x £ X k k G {1, . . . , k max } = 1C where 
p{k) is a prior on the dimension (here the number of components in the mixture). Thus we have defined a 
target density on {J ke!C {k} x X k . Now introduce a sequence of targets of dimension k: 7Tfc.„(x) cx ^(x) 7 " 
where < 71 < • • ■ < j p for some p > 1 given. Our trans-dimensional proposal is as follows: given a model 
order fc, propose a model order k' and use an SMC sampler to simulate the sequence i^k' ,n- The acceptance 
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Cluster Dendrogram from MC Iterations 
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Figure 6: Hierarchical clustering based on the relative frequency, over all PMCMC iterations, markets are 
assigned to the same clusters. 

probability, when resampling at each time-point of the SMC algorithm, of such a move is: 

U P n =i^EZiK,k' P{k')q{k\k') 

where q(k'\k) is the proposal density of moving from k to k' and lEili 1 "^' i s tnc marginal 

likelihood estimate from the SMC sampler in dimension k' . This allows one a possibility of producing very 
competitive trans-dimensional proposals. 
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A 


TNC 


BOND 


0.53 


A 


PTL 


ENERGY 


0.52 


A 


EDC 


IRATE 


0.77 


A 


ARS 


IRATE 


0.34 


A 


EUL 


IRATE 


0.59 


A 


EYT 


IRATE 


0.67 


A 


SSL 


IRATE 


0.96 


A 


CPN 


METAL 


0.59 


A 


HSH 


STOCK 


0.53 


A 


TWS 


STOCK 


0.73 


A 


KIS 


STOCK 


0.35 


A 


DXF 


STOCK 


0.42 


B 


WHC 


AGS 


0.56 


B 


SBC 


AGS 


0.36 




SGN 


AGS 


0.36 


B 


CCN 


AGS 


35 


B 


CFN 


AGS 


0.45 


B 


GTL 


BOND 


0.51 


B 


ABS 


BOND 


0.7 


B 


RBN 




0.48 


B 


CLN 


ENERGY 


0.9 


B 


HON 


ENERGY 


0.91 


B 


UKUS 


CURRENCY 


0.54 


B 


ADUS 


CURRENCY 


0.22 


B 


EUYN 


CURRENCY 


0.37 


B 


SFUS 


CURRENCY 


0.57 


B 


CDUS 


CURRENCY 


0.26 


B 


CLN 


METAL 


0.22 


B 


ADL 


METAL 


0.5 


B 


SLN 


METAL 


0.01 


B 


NKS 


STOCK 


0.38 


B 


ESTF 


STOCK 


0.29 


B 


FTL 


STOCK 


0.52 


B 


ESPC 


STOCK 


0.29 



Table 4: 



Cluster assignment assuming K = 2. 



